Short-Text Clustering using Statistical Semantics

نویسندگان

  • Sepideh Seifzadeh
  • Ahmed K. Farahat
  • Mohamed S. Kamel
  • Fakhri Karray
چکیده

Short documents are typically represented by very sparse vectors, in the space of terms. In this case, traditional techniques for calculating text similarity results in measures which are very close to zero, since documents even the very similar ones have a very few or mostly no terms in common. In order to alleviate this limitation, the representation of short-text segments should be enriched by incorporating information about correlation between terms. In other words, if two short segments do not have any common words, but terms from the first segment appear frequently with terms from the second segment in other documents, this means that these segments are semantically related, and their similarity measure should be high. Towards achieving this goal, we employ a method for enhancing document clustering using statistical semantics. However, the problem of high computation time arises when calculating correlation between all terms. In this work, we propose the selection of a few terms, and using these terms with the Nyström method to approximate the term-term correlation matrix. The selection of the terms for the Nyström method is performed by randomly sampling terms with probabilities proportional to the lengths of their vectors in the document space. This allows more important terms to have more influence on the approximation of the term-term correlation matrix and accordingly achieves better accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Short Text Conceptualization Using a Probabilistic Knowledgebase

Most text mining tasks, including clustering and topic detection, are based on statistical methods that treat text as bags of words. Semantics in the text is largely ignored in the mining process, and mining results often have low interpretability. One particular challenge faced by such approaches lies in short text understanding, as short texts lack enough content from which statistical conclu...

متن کامل

Hybrid semantic clustering of hashtags

Clustering hashtags based on their semantics is an important problem with many applications. The uncontrolled usage of hashtags in social media, however, makes the quality of semantics and the frequency of usage vary a lot, and this poses a challenge to the current approaches which capitalize on either the lexical semantics of a hashtag (by using metadata) or the contextual semantics of a hasht...

متن کامل

A Hybrid Approach to Semantic Hashtag Clustering in Social Media

The uncontrolled usage of hashtags in social media makes them vary a lot in the quality of semantics and the frequency of usage. Such variations pose a challenge to the current approaches which capitalize on either the lexical semantics of a hashtag by using metadata or the contextual semantics of a hashtag by using the texts associated with a hashtag. This thesis presents a hybrid approach to ...

متن کامل

Exploiting Document Level Semantics in Document Clustering

Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cat...

متن کامل

A Comparative Analysis of Particle Swarm Optimization and K-means Algorithm For Text Clustering Using Nepali Wordnet

The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection of data on the web there is a need for grouping(clustering) the documents into clusters for speedy information retrieval. Clustering of documents is collection of documents into groups such that the documents within each group are similar to each other and not to documents of other groups...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015